GEM-PRO - Genes & Sequences¶

This notebook gives an example of how to run the GEM-PRO pipeline with a dictionary of gene IDs and their protein sequences.

Input: Dictionary of gene IDs and protein sequences

Output: GEM-PRO model

Imports¶

In [1]:

import sys
import logging

In [2]:

# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO

In [3]:

# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Logging¶

Set the logging level in logger.setLevel(logging.<LEVEL_HERE>) to specify how verbose you want the pipeline to be. Debug is most verbose.

CRITICAL
- Only really important messages shown
ERROR
- Major errors
WARNING
- Warnings that don’t affect running of the pipeline
INFO (default)
- Info such as the number of structures mapped per gene
DEBUG
- Really detailed information that will print out a lot of stuff

Warning: DEBUG mode prints out a large amount of information, especially if you have a lot of genes. This may stall your notebook!

In [4]:

# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # SET YOUR LOGGING LEVEL HERE #

In [5]:

# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]

Initialization of the project¶

Set these three things:

ROOT_DIR
- The directory where a folder named after your PROJECT will be created
PROJECT
- Your project name
LIST_OF_GENES
- Your list of gene IDs

A directory will be created in ROOT_DIR with your PROJECT name. The folders are organized like so:

ROOT_DIR
└── PROJECT
    ├── data  # General storage for pipeline outputs
    ├── model  # SBML and GEM-PRO models are stored here
    ├── genes  # Per gene information
    │   ├── <gene_id1>  # Specific gene directory
    │   │   └── protein
    │   │       ├── sequences  # Protein sequence files, alignments, etc.
    │   │       └── structures  # Protein structure files, calculations, etc.
    │   └── <gene_id2>
    │       └── protein
    │           ├── sequences
    │           └── structures
    ├── reactions  # Per reaction information
    │   └── <reaction_id1>  # Specific reaction directory
    │       └── complex
    │           └── structures  # Protein complex files
    └── metabolites  # Per metabolite information
        └── <metabolite_id1>  # Specific metabolite directory
            └── chemical
                └── structures  # Metabolite 2D and 3D structure files

Note: Methods for protein complexes and metabolites are still in development.

In [6]:

# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()

PROJECT = 'genes_and_sequences_GP'
GENES_AND_SEQUENCES = {'b0870': 'MIDLRSDTVTRPSRAMLEAMMAAPVGDDVYGDDPTVNALQDYAAELSGKEAAIFLPTGTQANLVALLSHCERGEEYIVGQAAHNYLFEAGGAAVLGSIQPQPIDAAADGTLPLDKVAMKIKPDDIHFARTKLLSLENTHNGKVLPREYLKEAWEFTRERNLALHVDGARIFNAVVAYGCELKEITQYCDSFTICLSKGLGTPVGSLLVGNRDYIKRAIRWRKMTGGGMRQSGILAAAGIYALKNNVARLQEDHDNAAWMAEQLREAGADVMRQDTNMLFVRVGEENAAALGEYMKARNVLINASPIVRLVTHLDVSREQLAEVAAHWRAFLAR',
                       'b3041': 'MNQTLLSSFGTPFERVENALAALREGRGVMVLDDEDRENEGDMIFPAETMTVEQMALTIRHGSGIVCLCITEDRRKQLDLPMMVENNTSAYGTGFTVTIEAAEGVTTGVSAADRITTVRAAIADGAKPSDLNRPGHVFPLRAQAGGVLTRGGHTEATIDLMTLAGFKPAGVLCELTNDDGTMARAPECIEFANKHNMALVTIEDLVAYRQAHERKAS'}
PDB_FILE_TYPE = 'mmtf'

In [7]:

# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=PROJECT, root_dir=ROOT_DIR, genes_and_sequences=GENES_AND_SEQUENCES, pdb_file_type=PDB_FILE_TYPE)

[2018-02-05 18:11] [ssbio.pipeline.gempro] INFO: Creating GEM-PRO project directory in folder /tmp
[2018-02-05 18:11] [ssbio.pipeline.gempro] INFO: /tmp/genes_and_sequences_GP: GEM-PRO project location
[2018-02-05 18:11] [ssbio.pipeline.gempro] INFO: Loaded in 2 sequences
[2018-02-05 18:11] [ssbio.pipeline.gempro] INFO: 2: number of genes

Mapping sequence –> structure¶

Since the sequences have been provided, we just need to BLAST them to the PDB.

Note: These methods do not download any 3D structure files.

Methods¶

GEMPRO.blast_seqs_to_pdb(seq_ident_cutoff=0, evalue=0.0001, all_genes=False, display_link=False, outdir=None, force_rerun=False)[source]

BLAST each representative protein sequence to the PDB. Saves raw BLAST results (XML files).

Parameters:

seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form)
evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default).
all_genes (bool) – If all genes should be BLASTed, or only those without any structures currently mapped
display_link (bool, optional) – Set to True if links to the HTML results should be displayed
outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False

In [8]:

# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.9, evalue=0.00001)
my_gempro.df_pdb_blast.head(2)

[2018-02-05 18:11] [ssbio.pipeline.gempro] INFO: Completed sequence --> PDB BLAST. See the "df_pdb_blast" attribute for a summary dataframe.
[2018-02-05 18:11] [ssbio.pipeline.gempro] INFO: 2: number of genes with additional structures added from BLAST

Out[8]:

	pdb_id	pdb_chain_id	hit_score	hit_evalue	hit_percent_similar	hit_percent_ident	hit_num_ident	hit_num_similar
gene
b0870	3wlx	A	1713.0	0.0	1.0	1.0	333	333
b0870	3wlx	B	1713.0	0.0	1.0	1.0	333	333

Downloading and ranking structures¶

Methods¶

GEMPRO.pdb_downloader_and_metadata(outdir=None, pdb_file_type=None, force_rerun=False)[source]

Download ALL mapped experimental structures to each protein’s structures directory.

Parameters:	outdir (str) – Path to output directory, if GEM-PRO directories were not set or other output directory is desired pdb_file_type (str) – Type of PDB file to download, if not already set or other format is desired force_rerun (bool) – If files should be re-downloaded if they already exist

Warning: Downloading all PDBs takes a while, since they are also parsed for metadata. You can skip this step and just set representative structures below if you want to minimize the number of PDBs downloaded.

In [9]:

# Download all mapped PDBs and gather the metadata
my_gempro.pdb_downloader_and_metadata()
my_gempro.df_pdb_metadata.head(2)

[2018-02-05 18:11] [ssbio.pipeline.gempro] INFO: Updated PDB metadata dataframe. See the "df_pdb_metadata" attribute for a summary dataframe.
[2018-02-05 18:11] [ssbio.pipeline.gempro] INFO: Saved 11 structures total

Out[9]:

	pdb_id	pdb_title	description	experimental_method	mapped_chains	resolution	chemicals	taxonomy_name	structure_file
gene
b0870	3wlx	Crystal structure of low-specificity L-threoni...	Low specificity L-threonine aldolase (E.C.4.1....	X-RAY DIFFRACTION	A;B	2.51	PLG	Escherichia coli	3wlx.mmtf
b0870	4lnj	Structure of Escherichia coli Threonine Aldola...	Low-specificity L-threonine aldolase (E.C.4.1....	X-RAY DIFFRACTION	A;B	2.10	EPE;MG;PLR	Escherichia coli	4lnj.mmtf

GEMPRO.set_representative_structure(seq_outdir=None, struct_outdir=None, pdb_file_type=None, engine=’needle’, always_use_homology=False, rez_cutoff=0.0, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True, skip_large_structures=False, clean=True, force_rerun=False)[source]

Set all representative structure for proteins from a structure in the structures attribute.

Each gene can have a combination of the following, which will be analyzed to set a representative structure.

Homology model(s)

Ranked PDBs

BLASTed PDBs

If the always_use_homology flag is true, homology models are always set as representative when they exist. If there are multiple homology models, we rank by the percent sequence coverage.

Parameters:

seq_outdir (str) – Path to output directory of sequence alignment files, must be set if GEM-PRO directories were not created initially
struct_outdir (str) – Path to output directory of structure files, must be set if GEM-PRO directories were not created initially
pdb_file_type (str) – pdb, mmCif, xml, mmtf - file type for files downloaded from the PDB
engine (str) – biopython or needle - which pairwise alignment program to use. needle is the standard EMBOSS tool to run pairwise alignments. biopython is Biopython’s implementation of needle. Results can differ!
always_use_homology (bool) – If homology models should always be set as the representative structure
rez_cutoff (float) – Resolution cutoff, in Angstroms (only if experimental structure)
seq_ident_cutoff (float) – Percent sequence identity cutoff, in decimal form
allow_missing_on_termini (float) – Percentage of the total length of the reference sequence which will be ignored when checking for modifications. Example: if 0.1, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
allow_mutants (bool) – If mutations should be allowed or checked for
allow_deletions (bool) – If deletions should be allowed or checked for
allow_insertions (bool) – If insertions should be allowed or checked for
allow_unresolved (bool) – If unresolved residues should be allowed or checked for
clean (bool) – If structures should be cleaned
force_rerun (bool) – If sequence to structure alignment should be rerun

In [10]:

# Set representative structures
my_gempro.set_representative_structure()
my_gempro.df_representative_structures.head()

[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: 2/2: number of genes with a representative structure
[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: See the "df_representative_structures" attribute for a summary dataframe.

Out[10]:

	id	is_experimental	file_type	structure_file
gene
b0870	REP-3wlx	True	pdb	3wlx-A_clean.pdb
b3041	REP-1iez	True	pdb	1iez-A_clean.pdb

In [11]:

# Looking at the information saved within a gene
my_gempro.genes.get_by_id('b0870').protein.representative_structure
my_gempro.genes.get_by_id('b0870').protein.representative_structure.get_dict()

Out[11]:

<StructProp REP-3wlx at 0x7f9a0ae345f8>

Out[11]:

{'_structure_dir': '/tmp/genes_and_sequences_GP/genes/b0870/b0870_protein/structures',
 'chains': [<ChainProp A at 0x7f99fbd55710>],
 'date': None,
 'description': 'Low specificity L-threonine aldolase (E.C.4.1.2.48)',
 'file_type': 'pdb',
 'id': 'REP-3wlx',
 'is_experimental': True,
 'mapped_chains': ['A'],
 'notes': {},
 'original_structure_id': '3wlx',
 'resolution': 2.51,
 'structure_file': '3wlx-A_clean.pdb',
 'taxonomy_name': 'Escherichia coli'}

Creating homology models¶

For those proteins with no representative structure, we can create homology models for them. ssbio contains some built in functions for easily running I-TASSER locally or on machines with SLURM (ie. on NERSC) or Torque job scheduling.

You can load in I-TASSER models once they complete using the get_itasser_models later.

Info: Homology modeling can take a long time - about 24-72 hours per protein (highly dependent on the sequence length, as well as if there are available templates).

Methods¶

In [12]:

# Prep I-TASSER model folders
my_gempro.prep_itasser_modeling('~/software/I-TASSER4.4', '~/software/ITLIB/', runtype='local', all_genes=False)

[2018-02-05 18:12] [ssbio.pipeline.gempro] INFO: Prepared I-TASSER modeling folders for 0 genes in folder /tmp/genes_and_sequences_GP/data/homology_models

Saving your GEM-PRO¶

Warning: Saving is still experimental. For a full GEM-PRO with sequences & structures, depending on the number of genes, saving can take >5 minutes.

GEMPRO.save_json(outfile, compression=False): Save the object as a JSON file using json_tricks

In [13]:

import os.path as op
my_gempro.save_json(op.join(my_gempro.model_dir, '{}.json'.format(my_gempro.id)), compression=False)

[2018-02-05 18:12] [root] WARNING: json-tricks: numpy scalar serialization is experimental and may work differently in future versions
[2018-02-05 18:12] [ssbio.io] INFO: Saved <class 'ssbio.pipeline.gempro.GEMPRO'> (id: genes_and_sequences_GP) to /tmp/genes_and_sequences_GP/model/genes_and_sequences_GP.json